Research Questions

Data Origin

I retrieved data from the OECD website, where I downloaded migration information from all OECD countries in 2018. I subsequently retrieved a dataset linking countries with geographical subregions from github.

library(readr)
library(here)

wf_mig = read_csv(here("data-raw/workforce-migration.csv")) #migration data
sub_reg = read_csv(here("data-raw/subregions.csv")) #subregions data

The workforce migration data set includes the number of foreignly trained doctors who, in 2018, were registered or in the process of gaining registration to practise in a country other than the one in which they had obtained their medical education qualifications - this includes medical interns and residents.

head(wf_mig)
## # A tibble: 6 x 11
##   COU   Country   VAR   Variable        CO2   `Country of ori~   YEA  Year Value
##   <chr> <chr>     <chr> <chr>           <chr> <chr>            <dbl> <dbl> <dbl>
## 1 CAN   Canada    FTDS  Foreign-traine~ AFG   Afghanistan       2018  2018     5
## 2 FRA   France    FTDS  Foreign-traine~ AFG   Afghanistan       2018  2018    12
## 3 DEU   Germany   FTDS  Foreign-traine~ AFG   Afghanistan       2018  2018   153
## 4 NZL   New Zeal~ FTDS  Foreign-traine~ AFG   Afghanistan       2018  2018     0
## 5 NOR   Norway    FTDS  Foreign-traine~ AFG   Afghanistan       2018  2018    22
## 6 CHE   Switzerl~ FTDS  Foreign-traine~ AFG   Afghanistan       2018  2018     6
## # ... with 2 more variables: Flag Codes <lgl>, Flags <lgl>

The subregions dataset links countries with their respective geographical regions and subregions. I focused on subregions in my analysis because their number is more manageable than that of countries (far too many!) or regions (far too few!), which will enhance the readability of the plot.

head(sub_reg)
## # A tibble: 6 x 11
##   name      `alpha-2` `alpha-3` `country-code` `iso_3166-2` region `sub-region` 
##   <chr>     <chr>     <chr>     <chr>          <chr>        <chr>  <chr>        
## 1 Afghanis~ AF        AFG       004            ISO 3166-2:~ Asia   Southern Asia
## 2 Åland Is~ AX        ALA       248            ISO 3166-2:~ Europe Northern Eur~
## 3 Albania   AL        ALB       008            ISO 3166-2:~ Europe Southern Eur~
## 4 Algeria   DZ        DZA       012            ISO 3166-2:~ Africa Northern Afr~
## 5 American~ AS        ASM       016            ISO 3166-2:~ Ocean~ Polynesia    
## 6 Andorra   AD        AND       020            ISO 3166-2:~ Europe Southern Eur~
## # ... with 4 more variables: intermediate-region <chr>, region-code <chr>,
## #   sub-region-code <chr>, intermediate-region-code <chr>

Data Processing

You can see the full script here for a detailed explanation of all data processing steps taken.

I wanted my final dataset to include subregions of origin, destination subregions, and number of migrants. This required me to join wf_mig and sub_reg by ISO Alpha-3 codes. These are standardised codes for countries, and thus unique identifiers that function as joining keys.

Removing unnecessary data

I removed:

  • flows between identical countries (not interested in domestic migration)
  • unnecessary columns, e.g., the year of migration (since data is solely from 2018) or information about regions (since I was only interested in subregions).

I also gave columns suggestive, shorter names.

library(dplyr)

#wf_mig - keep relevant columns and rename them
wf_mig = wf_mig[, c("COU", "Country", "CO2", "Country of origin", "Value")]
wf_mig = wf_mig %>%
  rename(code_to = "COU",
         country_to = "Country",
         code_from = "CO2",
         country_from = "Country of origin",
         number = "Value"
)

#wf_mig - remove domestic migration
wf_mig = wf_mig %>%
  filter(country_to != country_from)

#sub_reg - keep relevant columns and rename them
sub_reg = sub_reg[, c("name", "alpha-3", "sub-region")]
sub_reg = sub_reg %>%
  rename(country = name,
         code = "alpha-3",
         subregion = "sub-region")

Joining data frames

I allocated subregions to each country by joining wf_mig and sub_reg by country code. I also provided helpful names to distinguish between origin and destination subregions.

#join datasets based on country code
data = left_join(wf_mig, sub_reg, 
                 by = c("code_to" = "code")
) #add subregions for destination countries

data = rename(data, subregion_to =  subregion) #destination subregion

data = left_join(data, sub_reg, 
                 by = c("code_from" = "code")
) #add subregions to origin countries

data = rename(data, subregion_from = subregion) #subregion of origin

Evaluating the final dataset

The final dataset had no missing values.

sapply(data, 
       function(x) sum(is.na(x))
)
##        code_to     country_to      code_from   country_from         number 
##              0              0              0              0              0 
##   subregion_to subregion_from 
##              0              0
head(data)
## # A tibble: 6 x 7
##   code_to country_to  code_from country_from number subregion_to  subregion_from
##   <chr>   <chr>       <chr>     <chr>         <dbl> <chr>         <chr>         
## 1 CAN     Canada      AFG       Afghanistan       5 Northern Ame~ Southern Asia 
## 2 FRA     France      AFG       Afghanistan      12 Western Euro~ Southern Asia 
## 3 DEU     Germany     AFG       Afghanistan     153 Western Euro~ Southern Asia 
## 4 NZL     New Zealand AFG       Afghanistan       0 Australia an~ Southern Asia 
## 5 NOR     Norway      AFG       Afghanistan      22 Northern Eur~ Southern Asia 
## 6 CHE     Switzerland AFG       Afghanistan       6 Western Euro~ Southern Asia

Data Transformations

The code for this work is here.

The visualization I planned followed the procedure of Sander et al. (2014) and required two objects:

The flow matrix

I created the subregions data frame, containing the total number of migrants per subregion of origin and destination, irrespective of country:

library(dplyr)
library(reshape2)

#get number of immigrants/emigrants at subregion level
subregions = data %>%
  group_by(subregion_from, subregion_to) %>%
  summarize(subregion_number = sum(number))

#convert subregions data frame into wide format
subregions = dcast(subregions,
                   subregion_from ~ subregion_to, #origin subregions as rows
                   value.var = "subregion_number"
) #convert into wide format

#set rownames to subregion_from to facilitate indexing
rownames(subregions) = subregions$subregion_from 

head(subregions)
##                                                  subregion_from
## Australia and New Zealand             Australia and New Zealand
## Central Asia                                       Central Asia
## Eastern Asia                                       Eastern Asia
## Eastern Europe                                   Eastern Europe
## Latin America and the Caribbean Latin America and the Caribbean
## Melanesia                                             Melanesia
##                                 Australia and New Zealand Eastern Europe
## Australia and New Zealand                            2689              1
## Central Asia                                            3             95
## Eastern Asia                                          756             32
## Eastern Europe                                        182           7343
## Latin America and the Caribbean                        69             16
## Melanesia                                              67             NA
##                                 Latin America and the Caribbean
## Australia and New Zealand                                     2
## Central Asia                                                  2
## Eastern Asia                                                  1
## Eastern Europe                                              113
## Latin America and the Caribbean                           10674
## Melanesia                                                    NA
##                                 Northern America Northern Europe
## Australia and New Zealand                    697             941
## Central Asia                                  26              95
## Eastern Asia                                 527             336
## Eastern Europe                              1699           11632
## Latin America and the Caribbean             2064             900
## Melanesia                                      4               5
##                                 Southern Europe Western Asia Western Europe
## Australia and New Zealand                     1           85             44
## Central Asia                                  2           69            442
## Eastern Asia                                 NA           NA            539
## Eastern Europe                               92        10892          24244
## Latin America and the Caribbean               6          652           2159
## Melanesia                                    NA           NA              1

I initialized a flow matrix with all subregions as rows and columns which contained only 0s, treating rows as origin subregions and columns as destination subregions.

I updated the values in the flow_matrix with the ones in subregions. This approach ensured that all possible combinations of subregions were present in flow_matrix, even if they were absent in the subregions data set. Absence would indicate those combinations of subregions had 0 migration levels.

#find all subregions in the dataset
unique_subreg = unique(c(unique(data$subregion_to), unique(data$subregion_from)))

#update flow_matrix with values from subregions
for(i in unique_subreg) { #take each unique subregion
  for(j in unique_subreg) { #combine it with all subregions
    flow_matrix[i, j] = ifelse( #for each combination
      (flow_matrix[i, j] != subregions[i, j] && #if subregions value is different from flow_matrix value
         !(is.na(subregions[i, j]))),  #providing subregions value is not missing
      subregions[i, j], #replace value in flow_matrix with subregions value 
      flow_matrix[i, j] #otherwise keep 0 in flow_matrix
    )
  }
}

At the end, the flow matrix looked like this:

head(flow_matrix)
##                    Eastern Asia South-Eastern Asia Sub-Saharan Africa
## Eastern Asia                  0                  0                  0
## South-Eastern Asia            0                  0                  0
## Sub-Saharan Africa            0                  0                  0
## Northern Africa               0                  0                  0
## Southern Europe               0                  0                  0
## Northern America              0                  0                  0
##                    Northern Africa Southern Europe Northern America
## Eastern Asia                     0               0              527
## South-Eastern Asia               0               0              488
## Sub-Saharan Africa               0               0             3849
## Northern Africa                  0               0             1895
## Southern Europe                  0             956              459
## Northern America                 0               1             1218
##                    Latin America and the Caribbean Western Asia
## Eastern Asia                                     1            0
## South-Eastern Asia                               1            2
## Sub-Saharan Africa                               1          125
## Northern Africa                                  0          228
## Southern Europe                                128         1570
## Northern America                                 8          534
##                    Australia and New Zealand Southern Asia Eastern Europe
## Eastern Asia                             756             0             32
## South-Eastern Asia                      1086             0              4
## Sub-Saharan Africa                      2828             0              2
## Northern Africa                           67             0             12
## Southern Europe                           94             0            110
## Northern America                         784             0              7
##                    Northern Europe Western Europe
## Eastern Asia                   336            539
## South-Eastern Asia             819            571
## Sub-Saharan Africa            5931           2112
## Northern Africa               5622           8771
## Southern Europe               4800          14367
## Northern America               172            463

The subregion details data frame

This data frame included colours for circle sectors, circle links, and the total flow of migrants (immigrants + emigrants) in each country. Code for this section has been largely adapted from Sander et al. (2014).

I added the number of emigrants, immigrants, and total migrants for each subregion to the newly created subregion_details dataframe.

#Compute number of emigrants per subregion 
df_from = data %>%
  group_by(subregion_from) %>%
  summarize(emig = sum(number))

#Compute number of immigrants per subregion
df_to = data %>%
  group_by(subregion_to) %>%
  summarize(immig = sum(number))

#create subregion_details data frame with info about total migration flow
subregion_details = left_join(df_from, 
                              df_to, 
                              by = c("subregion_from" = "subregion_to") 
)

##I am aware I could have done this by summing the rows and columns of `flow_matrix`
##but I wanted to do things this way so that I could compare the two outputs 
##and hopefully find they are identical as a way to check my work (they were!). 

Because circular plots offer limited space, I wanted to eliminate subregions with few migrants from the dataset, but also give the user the choice to include as many subregions as they want. In my case, I excluded subregions that had the bottom 20% number of total migrants.

#eliminate subregions where the total number of migrants is below the given quantile
(tiny_subreg = subset(subregion_details, 
                      total < quantile(total, 0.2) #user-defined quantile
))

#remove tiny subregions from subregion_details
subregion_details = subregion_details[!(subregion_details$subregion %in% tiny_subreg$subregion), ]

I then assigned colours to each available subregion after ordering them ascendently by the total number of migrants. This process runs automatically independently of how many subregions out of the total of 17 the user selects.

#add rgb codes to each subregion
rgb_pool =  c("255,0,0", #red
              "0,255,0", #lime
              "128,128,0", #olive   
              "148,0,211", #dark violet
              "0,206,209", #dark turquoise
              "255,0,255", #magenta
              "128,0,0", #maroon
              "255,99,71", #tomato
              "0,128,0", #green
              "0,0,255", #blue
              "128,0,128", #purple
              "0,128,128", #teal
              "0,0,128", #navy
              "250,128,144", #salmon
              "100,149,237", #corn flower blue
              "153,50,204", #dark orchid
              "60,179,113" #medium sea green
) #googled 17 rgb codes that enhance contrast; 17 = length(unique_subreg)

#select as many colours as needed depending on the amount of subregions included
subregion_details$rgb = rgb_pool[1:nrow(subregion_details)]

I then stored two versions of these colours in HEX format (one of which had increased transparency) for different elements of the graph. To accomplish this, I split colour codes into individual variables, and then used the rgb function to increase transparency and convert them into HEX codes.

#Split rgb codes into 3 variables
n = nrow(subregion_details)
subregion_details = cbind(subregion_details, #split codes and treat them as numbers
                          matrix(as.numeric(unlist(strsplit(subregion_details$rgb, split = ","))), 
                                 nrow = n, byrow = TRUE 
                                 ) #arrange them in n columns in a matrix
)

subregion_details = subregion_details %>%
  rename( #rename columns according to the colour index
    r = '1',
    g = '2',
    b = '3',
  )

#increase transparency and transform into HEX codes
subregion_details$rcol = rgb(subregion_details$r, 
                             subregion_details$g, 
                             subregion_details$b, 
                             max = 255
)

subregion_details$lcol = rgb(subregion_details$r, 
                             subregion_details$g, 
                             subregion_details$b, 
                             alpha = 200, #transparency index
                             max = 255
)

I also ordered rows in subregion_details ascendently by the total number of migrants to facilitate indexing later on, and added the xmin = 0 and xmax = subregion_details$total columns which will demarcate axis limits in the plot (from 0 to total amount of migrants) for each subregion.

At the end, the subregion_details data frame looked like this:

head(subregion_details)
##            subregion  emig immig total order       rgb   r   g   b    rcol
## 1       Eastern Asia  2191     0  2191     5   255,0,0 255   0   0 #FF0000
## 2 South-Eastern Asia  2971     0  2971     6   0,255,0   0 255   0 #00FF00
## 3 Sub-Saharan Africa 14848     0 14848     7 128,128,0 128 128   0 #808000
## 4    Northern Africa 16595     0 16595     8 148,0,211 148   0 211 #9400D3
## 5    Southern Europe 22484  1085 23569     9 0,206,209   0 206 209 #00CED1
## 6   Northern America  3187 24303 27490    10 255,0,255 255   0 255 #FF00FF
##        lcol xmin  xmax
## 1 #FF0000C8    0  2191
## 2 #00FF00C8    0  2971
## 3 #808000C8    0 14848
## 4 #9400D3C8    0 16595
## 5 #00CED1C8    0 23569
## 6 #FF00FFC8    0 27490

Data Visualization

The data visualization I have selected is a circular plot diagram and you can find the code here.

Background information

Essentially, this plot draws tracks on a circle and splits them into sectors to reflect differences between subregions in migrant numbers. It also plots links to illustrate the flow of migrants between subregions.

Readers who are interested in the inner workings of the code should consult Sander et al. (2014) and Gu (2020). I used both of these resources, but I prefer the latter because it offers an in-reasonable-depth explanation of the circlize package.

Preliminary circular plots

I started by setting some plotting parameters related to the size of the circle, padding, sector gaps, etc.

suppressPackageStartupMessages(library(circlize))

circos.clear() #reset circular layout parameters

par(mar = c(0, 0, 0, 0)) #margin around chart
circos.par(cell.padding = c(0, 0, 0, 0), 
           track.margin = c(0, 0.1), 
           start.degree = 45, #start plotting at 2 o'clock
           gap.degree = 2, #gap between circle sectors
           points.overflow.warning = FALSE, 
           canvas.xlim = c(-1.3, 1.3), #size of circle
           canvas.ylim = c(-1.3, 1.3)  #size of circle
)

I then initialized the layout to allocate subregions into sectors whose sizes are bounded by xmin and xmax. This approach ensures the relative size of sectors matches the relative difference in migration for each subregion.

circos.initialize(factors = subregion_details$subregion, #allocate sectors on circle to subregions
                  xlim = cbind(subregion_details$xmin, 
                               subregion_details$xmax) #set limits of the x axis for each sector between 0 and total = xmax
)

The next step involved creating a plotting region to which I added graphics.

  • I plotted subregion names using circos.text()
  • I plotted a first track using circos.rect() which was split into one sector per subregion reflecting the total amount of migrants (emigrants + immigrants).

This initial work resulted in:

circos.initialize(factors = subregion_details$subregion, #allocate sectors on circle to subregions
                  xlim = cbind(subregion_details$xmin, 
                               subregion_details$xmax) #set limits of the x axis for each sector between 0 and total = xmax
)

options(scipen = 10) #prevent scientific notation on plot

circos.trackPlotRegion(ylim = c(0, 1), #y-axis limits for each sector
                       factors = subregion_details$subregion, 
                       track.height = 0.1, 
                       panel.fun = function(x, y) { #for each new cell (i.e., intersection between sector and track)
                         name = get.cell.meta.data("sector.index") #retrieve cell meta data
                         i = get.cell.meta.data("sector.numeric.index")
                         xlim = get.cell.meta.data("xlim")
                         ylim = get.cell.meta.data("ylim")
                         
                         #plot subregion names
                         circos.text(x = mean(xlim), #position text at middle of sector
                                     y = ifelse(str_length(name) > 25, 4.5, 
                                                ifelse(str_length(name) > 20, 4, 
                                                       ifelse(str_length(name) >= 14, 3.3, 3))
                                                ), #distance from plot depending on length of character
                                     labels = name, #name of subregion
                                     facing = "clockwise", 
                                     niceFacing = TRUE,
                                     cex = 0.5, #scale text
                                     col = subregion_details$rcol[i]
                         )
                         
                         #plot a sector for each subregion
                         circos.rect(xleft = xlim[1], 
                                     ybottom = ylim[1], 
                                     xright = xlim[2], 
                                     ytop = ylim[2], 
                                     col = subregion_details$rcol[i], #use less transparent colours
                                     border = subregion_details$rcol[i]
                         )
        
                       }
)

This plot shows the relative number of total migrants (immigrants + emigrants) for each subregion. I next added another track where:

  • the coloured portions represent relative numbers of emigrants per subregion
  • the white portions represent relative numbers of immigrants per subregion.

This was achieved by including the following code inside the panel.fun() function in the previous code chunk.

                         #distinguish between immigrants and emigrants in each subregion
                         circos.rect(xleft = xlim[1], 
                                     ybottom = ylim[1], 
                                     xright = xlim[2] - rowSums(flow_matrix)[i], #i.e., total - emigrants
                                     ytop = ylim[1] + 0.3,
                                     col = "white", 
                                     border = "white"
                         ) 
                         
                         #add a white contour to separate the previous two rectangles
                         circos.rect(xleft = xlim[1], 
                                     ybottom = 0.3, 
                                     xright = xlim[2], 
                                     ytop = 0.32, 
                                     col = "white", 
                                     border = "white"
                         )

Final plot

Next, I included links between origin subregions and destination subregions, in order to show migration patterns, and an axis to give an indication of actual migrant numbers. This required some further data processing to transform flow_matrix into its long format and add parameters to guide the position of links, sums1 and sums2.

#plot links for each combination of regions
for(k in 1:nrow(flow_matrix_long)){ #for each row in the flow matrix
  i = match(flow_matrix_long$subregion_from[k],
            subregion_details$subregion) #get plotting details for subregion of origin
  j = match(flow_matrix_long$subregion_to[k],
            subregion_details$subregion) #get plotting details for destination subregion
  
  circos.link(sector.index1 = subregion_details$subregion[i], #need to identify indices to identify 
              point1 = c(subregion_details$sum1[i], 
                         subregion_details$sum1[i] + abs(flow_matrix[i, j])), #starting point of link
              
              sector.index2 = subregion_details$subregion[j], 
              point2 = c(subregion_details$sum2[j], 
                       subregion_details$sum2[j] + abs(flow_matrix[i, j])), #endpoint of link
              
              border = subregion_details$lcol[i],
              col = subregion_details$lcol[i], #use the more transparent colour to increase visibility
  )
  
  #update sum1 and sum2 to move along the circle into the next sector
  subregion_details$sum1[i] = subregion_details$sum1[i] + abs(flow_matrix[i, j]) 
  subregion_details$sum2[j] = subregion_details$sum2[j] + abs(flow_matrix[i, j])
}

As a result, I produced this plot - which is linked here as a png file because R output in this package looks horrendous.

The colour of links suggests their origin. Notice that immigration patterns are plotted through links starting in the emigrants section of a subregion (the second coloured arc inwards), and ending up in the immigrants section of another (or the same) subregion (the white arc continuing from the emigrant arc). The size of the link gives the relative amount of migrants moving from one subregion to another.

For example, the migration flow in Southern Asia consists exclusively of emigrants, most of whom tend to go to Northern Europe, with far fewer going to Australia and New Zealand or Northern America. As we can see, immigrants from Southern Asia make up a sizeable chunk of the total amount of immigrants in Northern Europe (the blue ribbon corresponds to about 25,000 immigrants in Northern Europe out of a total of about 75,000, as demarcated by the white stripe of the sector).

Summary

Findings show that:

Reflection

If I had more time to spend on this project:

Resources used

To complete this project, I used several resources:

You can find my repository here.